1 BPF CO-RE (Compile-Once Run-Everywhere) concept

Introduce eBPF, its early developement tool BCC and its problem, BPF CO-RE key tech, compilation, structure, and application life cycle.

1.1 What is eBPF (extended Berkeley Packet Filter)?

A eBPF program is a piece of user-provided code which is injected straight into a kernel. Once loaded and verified, BPF programs execute in kernel context. These programs operate inside kernel memory space with access to all the internal kernel state available to it.

Today, eBPF is used extensively to drive a wide variety of use cases: Providing high-performance networking and load-balancing in modern data centers and cloud native environments, extracting fine-grained security observability data at low overhead, helping application developers trace applications, providing insights for performance troubleshooting, preventive application and container runtime security enforcement, and much more.

eBPF structure

If we want our BPF program to bo able to run in other envirenment (i.e, portable), we normally do “On the fly” BPF compilation. The reason is:

  • Accessing kernel structs (e.g., task_struct or sk_buff). Kernel types and data structures are in constant flux
  • Memory layout changes between versions/configurations
  • BPF code needs to be compiled with fixed offsets/sizes

And so, early BPF developers use BCC tool.

1.1.1 BCC (BPF Compiler Collection) introduction and problem

With BCC, you embed your BPF program C source code into your user-space program (control application) as a plain string. When control application is eventually deployed and executed on target host, BCC invokes its embedded Clang/LLVM, pulls in local kernel headers (which you have to make sure are installed on the system from correct kernel-devel package), and performs compilation on the fly. This will make sure that memory layout that BPF program expects is exactly the same as in the target host’s running kernel.

If you have to deal with some optional and potentially compiled-out stuff in kernel, you’ll just do #ifdef/#else guarding in your source code to accommodate such hazards as renamed fields, different semantics of values, or any optional stuff not available on current configuration. Embedded Clang will happily remove irrelevant parts of your code and will tailor BPF program code to specific kernel.

BCC tools architecture

This sounds great, doesn’t it? Not quite so, unfortunately. While this workflow works, it’s not without major drawbacks.

  • Every production machine needs kernel headers
    • kernel-devel package required.
    • kernel-devel is missing internal headers. If you need something from kernel that is not exposed through public headers – you’ll need to copy/paste type definitions into your BPF code by hand to get your work done;
    • Kernel developers often have to build and deploy custom one-off kernels as part of their development process. And without a custom-built kernel header package, no BCC-based application will work on such kernels, stripping developers of a useful set of tools for debugging and monitoring.
    • kernel-devel can get out of sync
  • LLVM/Clang is big and heavy
    • libbcc.so > 100MB, resulting in big fat binaries that need to be distributed with your application.
    • compilation is a heavy-weight process:
      • can use lots of memory and CPU
      • potentially tipping over a carefully balanced production workfload. And vice versa, on a busy host, compiling a small BPF program might take minutes in some cases.
  • Testing and iteration are a pain
    • variety of kernel versions/configurations
    • “works on my machine” means nothing
    • You are going to get even a trivial compilation-time errors are detected only during the runtime, after you’ve rebuilt and restarted your user-space application completely; this significantly reduces development iteration time.

Can we compile once? Then run same binary everywhere?

1.2 What is BPF CO-RE ?

Libbpf + BPF CO-RE chose a different way. Their philosophy is that BPF programs are not much different from any “normal” user-space program: they ought to be compiled once into small binaries and then deployed unmodified in a compact form to target hosts. The goal is:

  • No kernel headers
  • No “on the fly” compilation
  • Upfront validation against production kernels

1.2.1 BPF CO-RE key ingredient

BPF CO-RE key ingredient relations

  • BTF (BPF Type Format)
    • It was created as an alternative to a more generic and verbose DWARF debug information.
      • It adds about 1.5 Mbytes to the kernel image (this is tiny in comparison to DWARF debuginfo, which can be hundreds of Mbytes).
      • describes all kernel types (size, layout, etc). No kernel headers required
      • always in sync wuth kernel
      • lossless BTF to C conversion
  • compiler (Clang)
    • To enable BPF CO-RE and let BPF loader (i.e., libbpf) to adjust BPF program to a particular kernel running on target host, Clang was extended with few built-ins. They emit BTF relocations which capture a high-level description of what pieces of information BPF program code intended to read.
    • For example, if you were going to access task_struct->pid field, Clang would record that it was exactly a field named “pid” of type “pid_t” residing within a struct task_struct. This is done so that even if target kernel has a task_struct layout in which “pid” field got moved to a different offset within a task_struct structure (e.g., due to extra field added before “pid” field), or even if it was moved into some nested anonymous struct or union (and this is completely transparent in C code, so no one ever pays attention to details like that), you’ll still be able to find it just by its name and type information. This is called a field offset relocation. It is possible to capture (and subsequently relocate) not just a field offset, but other field aspects, like field existence or size.
  • user-space BPF loader library (libbpf)
    • Provided by kernel developers. It’s utilized to develop BPF program. Alos serves as a BPF program loader. Letting developers worry only about BPF program correctness and performance. You don’t have to embed Clang/LLVM into your application. And you don’t need to compile while executing in target host.
      • Libbpf takes compiled BPF ELF (Executable and Link File) object file, post-processes it as necessary, sets up various kernel objects (maps, programs, etc), and triggers BPF program loading and verification.
      • Libbpf looks at BPF program’s recorded BTF type and relocation information and matches them to a BTF information provided by the running kernel. Libbpf resolves and matches all the types and fields, updates necessary offsets and other relocatable data as needed to make sure that BPF program’s logic is correctly functioning for a specific kernel on the host.
      • Provide Kconfig variables
        • Allow BPF programs to accommodate various kernel version- and configuration-specific changes. BPF program can define an extern variable with a well-known name (e.g., “LINUX_KERNEL_VERSION” to extract a running kernel version) or a name that matches one of Kconfig’s keys (e.g., “CONFIG_HZ” to get the value of HZ that kernel was built with) and libbpf will do its magic to set everything up in such a way that your BPF program can use such extern variables as any other global variable. These variables will have correct values, matching the active kernel your BPF program is executed in. Additionally, BPF verifier will track those variables as known constants and will be able to use them for advanced control flow analysis and dead code elimination.
      • Provide struct flavors
        • Helps with cases where different kernels have incompatible types, so it’s just impossible to compile a single BPF program for both kernels with a single common struct definition. When performing necessary relocations, struct flavors allows to have multiple alternative (and incompatible) definitions for the same kernel type in a single C program and be able to pick the most appropriate one in runtime and use type cast to a struct flavor to extract necessary fields.
        • Without struct flavors, it would be impossible to really have a “compile once” program that could run on multiple kernels in cases like above. You’d need #ifdef’ed source code, compiled into two separate BPF program variants, with appropriate variant picked manually by control application in runtime. All this would be just unnecessary added complexity and pain.
  • bpftool
    • a useful tool that helps with BPF program developement
      • bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
        • Get a compilable C header file with all kernel types, including those that are never exposed through headers provided by the kernel-devel package
      • bpftool gen skeleton <app>.bpf.o > <app>.skel.h
        • Generating BPF skeleton header file from compiled BPF object file for user-space program

You’ll end up with a small user-space binary that embeds compiled BPF code through BPF skeleton and has statically linked libbpf in it, so doesn’t depend on system-wide libbpf availability. The result is a small (200KB), fast, stand-alone binary that can be run everywhere. The BCC project has a collection of these, called libbpf tools.

1.3 how does BPF CO-RE work?

1.3.1 BPF CO-RE compilation requirement

  • kernel must support BTF debug information. Just build the kernel with CONFIG_DEBUG_INFO_BTF=y option (requires pahole >= 1.13 from DWARF package). Some major Linux distributions come with kernel BTF already built in:
    • Fedora 31+
    • RHEL 8.2+
    • OpenSUSE Tumbleweed (in the next release, as of 2020-06-04)
    • Arch Linux (from kernel 5.7.1.arch1-1)
    • Manjaro (from kernel 5.4 if compiled after 2021-06-18)
    • Ubuntu 20.10
    • Debian 11 (amd64/arm64)
  • Clang 10 or newer should work fine for most BPF features, but some more advanced BPF CO-RE features might require Clang 11 or even 12 (e.g., for some of the more recent and advanced CO-RE relocation built-ins).
  • libbpf and its dependencies:
    • zlib (libz-dev or zlib-devel package)
    • libelf (libelf-dev or elfutils-libelf-devel package)
  • bpftool

1.3.2 What is BPF maps?

BPF maps is a BPF concept for abstract data container. Many different things are modeled as BPF maps: from simple arrays and hash maps to per-socket and per-task local storage, BPF perf and ring buffers, and even some more exotic uses. The important thing is that most BPF maps allow looking up, updating, and deleting its elements by some key. BPF maps are the means to share the state between (potentially many) BPF programs and user-space. Define maps data structure in BPF program usually (there is a few exceptions like PERF_EVENT_ARRAY, STACK_TRACE, DEVMAP, CPUMAP, etc) looks like this:

struct {
    __uint(type, BPF_MAP_TYPE_<type>); /* ARRAY, HASH, PERCPU_ARRAY, ...*/
    __uint(max_entries, <max entry number>);
    __type(key, <key variable type>);
    __type(value, <value variable type>); /* if it's a struct, you can define it manually in <app>.h or choose one from vmlinux.h */
} <map name> SEC(".maps");

1.3.3 BPF CO-RE compilation and deployment

BPF CO-RE compilation process

  • Generating vmlinux.h header file with all kernel types
  • compiling your BPF program source code using recent Clang (version 10 or newer) into .o object file. strip useless DWARF info with LLVM if you can.
  • generating BPF skeleton header file from compiled BPF object file
  • including generated BPF skeleton header to use from user-space code
  • then, at last, compiling user-space code, which will get BPF object code embedded in it, so that you don’t have to distribute extra files with your application.

BPF CO-RE deployment process

1.3.4 BPF CO-RE program file and structure

  • <app>.bpf.c: BPF C code that contain the logic which is to be executed in the kernel context. There could be many BPF programs defined within the same BPF C code file. They could have different types (i.e., SEC() annotations). You can also define multiple BPF programs with the same SEC() attribute: libbpf will handle that just fine. All BPF programs defined within the same BPF C code file share all the global state (global variable, BPF map). This is frequently utilized to coordinate few collaborating BPF programs.
    • Define global variable which BPF code can read and update just like any user-space C code would do with a global variable.
      • Add const volatile marks the variable as read-only for BPF code and user-space code. Can be set and modified from user-space only before a BPF skeleton is loaded.
    • include vmlinux.h : This header contains all kernel types: those exposed as part of UAPI, internal types available through kernel-devel, and some more internal kernel types not available anywhere else. Unfortunately, BTF (as well as DWARF) doesn’t record #define macros, so some common macros might be missing with vmlinux.h. Most commonly missing ones might be provided as part of libbpf’s bpf_helpers.h (kernel-side “library”, provided by libbpf).
    • include bpf_helpers.h : provided by libbpf and contains most-often used macros, constants, and BPF helper definitions, which are used by virtually every existing BPF application. More info about this header here
      • bpf_map_<operation>_elem(&some_map, &keyvar[, &valuevar][, args]) : Manipulate maps in kernel. Common operations include lookup, update, delete, push, pop, peek, etc.
      • SEC() macro defines the BPF program which will be loaded into the kernel. It’s is represented as a normal C function in a specially-named section.
        • SEC("tp/syscalls/sys_enter_write") int handle_tp(void *ctx) { ... } define a tracepoint BPF program, which will be called each time a write() syscall is invoked from any user-space application.
        • char LICENSE[] SEC("license") defines the license of your BPF code. Specifying the license is mandatory and is enforced by the kernel. Like GPL, GPL v2, GPL and additional rights, Dual BSD/GPL, Dual MIT/GPL, or Dual MPL/GPL. Some BPF functionality is unavailable to non-GPL-compatible code.
  • <app>.c: user-space C code, which loads BPF code and interacts with it throughout the lifetime of the application
    • include bpf.h: defines various userspace bpf helpers for working with BPF programs and maps
    • include libbpf.h: ncludes libbpf types and functions.
    • include <app>.skel.h: reflects the high-level structure of <app>.bpf.c. It also simplifies the BPF code deployment logistics by embedding contents of the compiled BPF object code inside the header file
      • Starting from Linux 5.5 version, global variables can be read and written from the user-space side:
        • skel->rodata for read-only variables
        • skel->bss for mutable zero-initialized variables
        • skel->data for non-zero-initialized mutable variables.
    • libbpf_set_print() provides a custom callback for all libbpf logs. This is extremely useful, especially during active development, because it allows to capture helpful libbpf debug logs
    • Bump up RLIMIT_MEMLOCK limit. Bumps kernel’s internal per-user memory limit to allow BPF sub-system to allocate necessary resources for your BPF programs, maps, etc. You have to bump RLIMIT_MEMLOCK limit one way or another. Doing it through setrlimit(RLIMIT_MEMLOCK, ...), which should be called at the very beginning of your program, is the simplest and the most convenient way
  • <app>.h (optional): a header file with the common type definitions and is shared by both BPF and user-space code of the application.

1.3.5 BPF CO-RE application life cycle

BPF application typically goes through the following phases (Generated BPF skeleton has corresponding functions to trigger each phase):

  • open phase <-> obj = <name>__open(): BPF object file is parsed: BPF maps, BPF programs, and global variables are discovered, but not yet created. After a BPF app is opened, it’s possible to make any additional adjustments (setting BPF program types, if necessary; pre-setting initial values for global variables, etc), before all the entities are created and loaded.
  • load phase <-> err = <name>__load(obj): BPF maps are created, various relocations are resolved, BPF programs are loaded into the kernel and verified. At this point, all the parts of a BPF application are validated and exist in kernel, but no BPF program is yet executed. After the load phase, it’s possible to set up initial BPF map state without racing with the BPF program code execution. You can combine open phase and load phase with <name>__open_and_load() if you don’t need to adjust your BPF program before open phase.
  • Attachment phase <-> err = <name>__attach(obj): This is the phase at which BPF programs get attached to various BPF hook points (e.g., tracepoints, kprobes, cgroup hooks, network packet processing pipeline, etc). This is the phase at which BPF starts performing useful work and read/update BPF maps and global variables.
  • Tear down phase <-> <name>__destroy(obj): BPF programs are detached and unloaded from the kernel. BPF maps are destroyed and all the resources used by the BPF app are freed.

2 準備 Debian11 主要編譯環境: 從 Debian10 環境更新

如果你可以直接安裝 Debian11 環境,那就直接裝並跳過此小節,本章節是針對 debian11 無法順利安裝時改用 debian10 升級成 debian11 的步驟說明

debian11 kernel 預設設定有 CONFIG_DEBUG_INFO=yCONFIG_DEBUG_INFO_BTF=y (產生 /sys/kernel/btf/vmlinux,用來產生 vmlinux.h)

2.1 更新 repository 清單

如果 debian10 還沒有進行鏡像站設定,首先要修改 /etc/apt/source.list ,註解 cdrom ... 這行,加入鏡像站位置 (例如台灣 http://ftp.tw.debian.org/debian)共四行,並且加入 non-free,完成後檔案內容像這樣:

deb http://ftp.tw.debian.org/debian/ buster main contrib non-free
deb-src http://ftp.tw.debian.org/debian/ buster main contrib non-free

deb http://security.debian.org/debian-security buster/updates main
deb-src http://security.debian.org/debian-security buster/updates main

# bullseye-updates, previously known as 'volatile'
deb http://ftp.tw.debian.org/debian/ buster-updates main contrib non-free
deb-src http://ftp.tw.debian.org/debian/ buster-updates main contrib non-free

接著執行以下指令,將 debian10 的清單換成 debian11 的

sudo sed -i 's/buster/bullseye/g' /etc/apt/sources.list
sudo sed -i 's/buster/bullseye/g' /etc/apt/sources.list.d/*

確認清單把 buster 換成 bullseye:

deb http://ftp.tw.debian.org/debian/ bullseye main contrib non-free
deb-src http://ftp.tw.debian.org/debian/ bullseye main contrib non-free

deb http://security.debian.org/debian-security bullseye/updates main
deb-src http://security.debian.org/debian-security bullseye/updates main

# bullseye-updates, previously known as 'volatile'
deb http://ftp.tw.debian.org/debian/ bullseye-updates main contrib non-free
deb-src http://ftp.tw.debian.org/debian/ bullseye-updates main contrib non-free

最後要修改 debian security,然後儲存檔案:

deb http://ftp.tw.debian.org/debian/ bullseye main contrib non-free
deb-src http://ftp.tw.debian.org/debian/ bullseye main contrib non-free

deb https://deb.debian.org/debian-security bullseye-security main contrib
deb-src https://deb.debian.org/debian-security bullseye-security main contrib

# bullseye-updates, previously known as 'volatile'
deb http://ftp.tw.debian.org/debian/ bullseye-updates main contrib non-free
deb-src http://ftp.tw.debian.org/debian/ bullseye-updates main contrib non-free

2.2 進行系統更新

更新 repository lists

sudo apt update

先進行最小化更新,如果畫面顯示套件重新啟動的相關訊息就按 q 見跳過;如果出現在套件升級時未經詢問重啟服務的訊息你可以選擇左邊的「Yes」簡化安裝,這個過程大約花費半小時

sudo apt upgrade --without-new-pkgs

接著進行完整升級,和上面類似,這次會花費大約一個半小時

sudo apt full-upgrade

完成之後重開機,你可以清除舊套件和暫存檔案來釋放大量空間

sudo apt --purge autoremove
sudo apt autoclean

3 編譯 bpf CO-RE 範例程式碼及驗證

3.1 下載編譯工具

執行以下指令安裝必要套件和工具

apt-get update
apt-get install clang build-essential bpftool git libbpf-dev

3.2 編譯範例程式碼

首先取得範例程式碼

git clone https://github.com/sartura/ebpf-core-sample
cd ebpf-core-sample

針對 hello.bpf.c 做修改

#include "vmlinux.h"
#define BPF_NO_GLOBAL_DATA 1
#include <bpf/bpf_helpers.h>

修改完成後依序執行以下指令,忽視警告訊息,你會得到二個可以執行的檔案 hellomaps

# 產生 vmlinux.h 標頭檔
bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h
# 編譯 hello.bpf.c 產生 hello.bpf.o 物件檔
# gcc 目前不支援 .bpc.c 的編譯,未來說不定可以
# -g 
clang -g -O2 -target bpf -D__TARGET_ARCH_x86_64 -I . -c hello.bpf.c -o hello.bpf.o
# 由 hello.bpf.o 產生 hello.skel.h 標頭檔
bpftool gen skeleton hello.bpf.o > hello.skel.h
# 編譯 hello.c 產生 hello.o 物件檔
clang -g -O2 -Wall -I . -c hello.c -o hello.o
# 獲得 libbpf 原始碼
git clone https://github.com/libbpf/libbpf && cd libbpf/src/
# 將 libbpf 編譯成靜態函式庫
make BUILD_STATIC_ONLY=1 OBJDIR=../build/libbpf DESTDIR=../build INCLUDEDIR= LIBDIR= UAPIDIR= install
# 回到原本的資料夾
cd ../../
# 將 hello.o 與 libbpf 靜態函式庫連接,產生 hello 執行檔
# -lelf 和 -lz 代表 libbpf 的相依套件,必須要提供給編譯器
clang -Wall -O2 -g hello.o libbpf/build/libbpf.a -lelf -lz -o hello
# 以相同作法產生 maps 執行檔
clang -g -O2 -target bpf -D__TARGET_ARCH_x86_64 -I . -c maps.bpf.c -o maps.bpf.o
bpftool gen skeleton maps.bpf.o > maps.skel.h
clang -g -O2 -Wall -I . -c maps.c -o maps.o
clang -Wall -O2 -g maps.o libbpf/build/libbpf.a -lelf -lz -o maps

3.3 驗證移植性

試著在 debian11 上執行 sudo ./hellosudo ./maps,它們可以印出接下來的指令或是動作。確認可以正常執行後,將整個資料夾移動至 debian10 內,直接執行執行檔,會有相同的結果,也就符合 CO-RE 的精神

4 參考資料